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Current models for isolated word recognition perform very well on 
small vocabularies of distinctly different sounding words. However, 
when we are confronted with vocabularies of similar sounding words 
(e.g., the letters of the alphabet), the performance of isolated word 
recognizers decreases dramatically. By carefully reexamining the 
model used for isolated word recognition we have identified some of 
the inherent deficiencies. In this paper we propose an improved word- 
recognition model that is inherently capable of accurately recogniz- 
ing words from almost any vocabulary. We have investigated a simple 
implementation of the model that preserves most of the structure of a 
linear predictive coding (lpc)- based version of the canonic isolated 
word model. In an experimental evaluation of the improved model, 
using an alpha-digit vocabulary, recognition accuracy improvements 
of from 1 to 5.7 percent were obtained for four talkers. The improve- 
ments were due to changes in both the analysis model and the 
decision procedure. The strengths and weaknesses of the improved 
model are discussed. 

I. INTRODUCTION 

Although the goal of continuous speech recognition by machine 
remains far out of reach, the one area of speech recognition that is 
practical with today's technology and understanding is that of isolated 
word recognition. 1 " 6 What is interesting about this area is that the 
general approach used to solve the isolated word-recognition problem 
(i.e., the statistical-pattern-recognition approach) bears little relation- 
ship to the way in which humans understand speech. As a result the 
vocabularies for which the isolated word recognizers can achieve good 
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performance are severely constrained in both size and complexity. If 
we are interested in using a vocabulary for which the performance of 
the isolated word recognizer is less than perfect (e.g., the letters of the 
alphabet), then we have to rely on the syntax and semantics of the 
recognition task to provide the desired level of performance from the 
overall system. 8 " 10 

In an effort to improve word-recognition accuracy for arbitrary 
vocabularies, we have re-examined the word-recognition model and 
proposed a somewhat more general structure. The proposed changes 
in the model include an improved feature analysis in which both long- 
time and short-time features are measured, and an improved decision 
box in which the two-pass decision rule of Rabiner and Wilpon" is 
adapted to the speaker-trained case. 

The implementation of the improved word-recognition model, which 
we have studied, is based on the standard linear predictive coding 
(lpc) word recognizer as originally proposed by Itakura. 12 In an effort 
to retain as much of the original structure as possible, we have used 
the standard lpc analysis as the long-time features, and a new lpc 
analysis based on 15-ms frames as the short-time features. Experimen- 
tation with the improved model, using a 39-word vocabulary of the 
alphabet, the digits, and three command words in a speaker-trained 
mode, showed improvements in accuracy of from 1 to 5.7 percent for 
four talkers. An analysis of the results showed that the improved 
feature analysis provided only small improvements in accuracy (from 
to 1.3 percent), whereas the two-pass decision rule provided some- 
what larger improvements in accuracy (from 1 to 4.4 percent). 

The outline of this paper is as follows. In Section II we briefly review 
the canonic isolated word-recognition model and discuss its strengths 
and weaknesses. We also discuss, in this section, the implementation 
of the model based on lpc feature analysis and an lpc distance 
measure. In Section III we present the improved word-recognition 
model and discuss how it was implemented within the structure of the 
LPC-based recognizer. In Section IV we describe the experimental 
evaluation of the improved model based on the alpha-digit vocabulary. 
Finally, in Section V we discuss the results and their implications for 
practical systems. 

II. THE CANONIC MODEL FOR ISOLATED WORD RECOGNITION 

Figure 1 is a block diagram of the canonic (statistical-pattern-rec- 
ognition) model for isolated word recognition. The three basic com- 
ponents of the model include: 

(i) Feature measurement in which the speech signal is analyzed to 
provide a set of Q features (e.g., filter bank energies, lpc coefficients, 
etc.) once every M samples. If the isolated word is of duration Lx M 
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Fig. 1 — Block diagram of standard isolated word-recognition model. 



samples, then a total of L sets of features characterize the word. The 
matrix of Q X L features is called the test pattern. 

(ii) Pattern similarity measurement in which a score (similarity or 
distance) relating the similarity of the test pattern to each of a set of 
V reference patterns is computed. Pattern similarity involves both 
time alignment (registration) of the test and reference pattern, and 
distance computation along the alignment path. The output of the 
pattern similarity box is a set of V distance scores, i.e., one for each 
reference pattern. 

(Hi) A decision rule in which the distance scores are used to provide 
an ordered (by distance) list of recognition candidates. Generally, the 
candidate with the smallest distance is chosen as the "recognized 
word." 

Rather than dwelling further on the canonic model we now review 
the lpc implementation of this model, as we will be relying on this 
structure throughout this paper. We will return to the canonic model 
in Section 2.2 when we discuss its limitations and propose the improved 
model. 

2.1 The LPC-based implementation of the word recognizer 

Figure 2 is a block diagram of the feature measurement for an lpc- 
based analyzer. The digitized speech signal (digitized at a 6.67-kHz 
rate) is first preemphasized using a simple first-order digital network 
and then blocked into overlapping frames of N (300) samples with 
consecutive frames overlapping by 200 samples. Thus, a frame spacing 
of M = 100 samples is used (i.e., 67 frames/second). Each speech frame 
is then windowed by a 300-sample Hamming window, and a pth-order 
(p = 8) autocorrelation analysis is performed. A full lpc analysis 
(using the autocorrelation method 15 ) is then performed giving the set 
of (p + 1) lpc coefficients as the features for each frame. 

The pattern similarity processing is carried out using a dynamic 
time- warping (dtw) algorithm in which the test pattern is simultane- 
ously time aligned with each reference pattern, and a distance along 
the time-alignment path is computed. One of the major features of this 
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Fig. 2— Block diagram of lpc analysis system. 

processing is the local distance measure, which relates the distance 
between a frame of the test pattern and a frame of the reference 
pattern, of the form 12 

a/fWak 



d(T, R) = log 



ay-Vrar 



(1) 



where a/? and ar are the lpc feature sets of reference and test, 
respectively, and Vr is the autocorrelation coefficient set of the test. 
The distance measure of eq. (1), called the lpc log-likelihood measure, 
can be computed using only (p + 1) multiplications and additions, and 
one logarithm. 12 Furthermore, the lpc distance of eq. (1) has been 
shown to have reliable and well understood statistical properties. 14,15 
In particular, if both a R and ar are derived from the same underlying 
stationary random process, then d(T, R) is precisely x 2 distributed 
with p degrees of freedom. This statistical behavior of the lpc distance 
holds for fricative sounds. For voiced speech, although the model is 
inexact on a frame-by-frame basis, the statistical properties are ap- 
proximately correct on a time-average basis. 

To compute the pattern similarity between the test and each refer- 
ence pattern using the dtw algorithm with the distance measure of eq. 
(1), a solution to the minimization of 

r nt 
D* = min X d(T„, R uM ) (2) 

win) „ = \ 

must be found where NT is the number of frames in the test, and w(n) 
is the warping function relating frame n of the test to frame w{n) of 
the reference. Efficient recursive procedures for solving eq. (2) have 
been described in the literature. 12 ' 16 " 18 

Finally, the decision box orders the distance scores for each reference 
pattern and chooses either the reference with the minimum distance 
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(the nearest neighbor rule) or the reference whose average of the K- 
best scores (for multiple-template systems) is minimum (the /C-nearest 
neighbor rule) as the recognized word. When the recognized word is 
unique (i.e., only a single reference gets a small distance score), this 
simple decision rule is sufficient. However, for complex vocabularies, 
generally several references achieve small distance scores, and reliable 
recognition using the smallest distance cannot be achieved. In such 
cases a two-pass decision rule" has been shown to increase accuracy 
by deferring the final recognition decision to a discrimination analysis 
in a second pass of the decision rule. This discrimination analysis has 
only been applied to speaker-independent systems because of the 
problems associated with obtaining appropriate word discrimination 
weights. 11 

2.2 Strengths and limitations of the word-recognition model 

The strengths of the canonic word-recognition model of Fig. 1 are as 
follows: 

(i) It is invariant to different speech vocabularies, users, feature 
sets, pattern similarity algorithms, and decision rules. 
(ii) It is easy to implement. 

(Hi) It works well in practice. 
The weaknesses of the model include: 

(i) The feature analysis only adequately represents long-time sta- 
tionary events in the speech signal; nonstationary and transient events 
are only poorly represented. 

(ii) The model does not perform well for complex vocabularies with 
acoustically similar words. 

We now consider the first weakness of the model. By way of example 
Fig. 3 shows waveform plots of the beginning regions of two distinct 
words. Word 1 shows a silence followed by the onset of voiced speech. 
Word 2 shows a short (15 ms) transient of low-level, unvoiced speech 
(e.g., a plosive sound) followed by the onset of voiced speech. Figure 
3 also shows the placement of the first two long-time speech segments 
(frames), which contain identical data except for the first 15 ms of the 
first segment, in which one frame has silence and one frame has a short 
plosive. It should be clear that for a long-time analysis such as the lpc 
model of Section 2.1, the low-level differences in the first 15 ms of 
frame 1 will be swamped out by the high-level voiced speech in the 
last 30 ms of the frame. Thus, in a long-time stationary framework 
accurate recognition of differences between short transients and other 
nonstationary regions (e.g., as occur during onsets and offsets of 
voicing) is greatly limited. Thus, to ameliorate this weakness, the 
feature-detection algorithm must be enhanced to include some repre- 
sentation of short-time nonstationary events. 
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Fig. 3— How a short transient in a word can be swamped out by a voiced region in the 
long-time analysis model. 

Consider now the second weakness of the model. The reason that 
acoustically similar words are easily confused is that the pattern- 
similarity measure (the dtw distance) gives equal weight to all frames 
of the word. For differentiating words of one equivalence class from 
words of another equivalence class this procedure is reasonable. How- 
ever, within a class of acoustically similar words a discrimination 
analysis rather than a straight recognition is required. Such an analysis 
has been proposed by Rabiner and Wilpon 11 for the case of speaker- 
independent recognition of words. 

For speaker-trained recognizers this two-pass decision rule must be 
modified so that the optimal weighting curves for word discrimination 
could be obtained directly from the robust training procedure. 19 

With the incorporation of the expanded feature analysis, a modified 
dtw algorithm, and an expanded decision rule, the basic weaknesses 
of the canonic word recognizer can be overcome to some extent. In the 
next section we describe an "improved" model for word recognition 
and show how the improvements can be incorporated directly into the 
lpc framework of Section 2.1. 

III. THE IMPROVED WORD-RECOGNITION MODEL 

Based on the discussion of Section 2.2, the improved word-recogni- 
tion model would have a structure of the type shown in Fig. 4. The 
major differences in the model, from that of Fig. 1, are: 

(i) The feature measurement box is expanded into three sub- 
blocks, namely long-time feature measurements, short-time feature 
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Fig. 4 — Block diagram of the improved, isolated word-recognition model using both 
long-time and short-time features, and a two-pass discrimination model. 



measurements, and a stationarity profile. The long-time features are 
essentially those of the original model, although the rate at which they 
are measured will generally be higher for this new model than for the 
original model. The short- time features are intended to characterize 
transients and other nonstationary events in the speech signal. Some 
typical short-time features include zero or level crossing counts over 
short- time intervals, wideband (short-impulse response), filter bank 
analyses, short-time lpc analyses, etc. The stationarity profile decides 
which feature set (either long-time or short-time) is used to character- 
ize a given frame of speech, and hence is used for the distance measure 
of the pattern-similarity algorithm. 

(ii) The dtw algorithm is expanded to use both long-time and 
short-time patterns, for both test and reference pattterns, in determin- 
ing similarity of a given reference pattern to the test pattern. The 
stationarity profile is used to guide the alignment and to choose which 
feature set is used in making a given distance computation. 

(Hi) The decision box is implemented as a two-pass decision. In the 
first-pass decision the distance scores for each reference pattern are 
ordered, and if the best distance is smaller than the second best 
distance by a threshold T*, the decision phase is terminated. If, 
however, the top two or more references are within T* in distance, a 
second-pass decision rule is used in which the similar words are 
compared using a discriminant analysis and the recognized word is 
chosen on the basis of this analysis. To implement the discriminant 
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analysis, a set of distance-weighting curves discriminating word i from 
wordy (for all i,j) must be saved along with the reference patterns. 
We now describe how the improved model was implemented in the 
framework of the lpc analysis system. 

3. 1 The LPC basic improved word-recognition model 

Using the lpc analysis framework, the expanded feature measure- 
ment was implemented as follows. The long-time analysis was imple- 
mented as described in Section 2.1 except that the shift parameter, M, 
was changed from M = 100 to M = 33, and the analysis frame length, 
N, was changed from N = 300 to N = 297. Thus, for the long-time 
analysis, analysis frames were computed every 5 ms rather than every 
15 ms, thereby giving a frame rate three times larger. The analysis 
frame was changed to 297 samples so as to be an integral multiple of 
M, the shift parameter. We denote the long-time lpc features as T L t- 

For convenience the short-time analysis was implemented with the 
same processing (i.e., that of Fig. 2) as that of the long-time analysis, 
except that N was changed to 99 (15-ms analysis frames) and M was 
again set to 33 (5-ms frame shifts). The order of the lpc analysis was 
kept at 8 for the short-time as well as the long-time analysis. We 
denote the short-time lpc features as Tst. 

To understand how the stationarity profile, fi, is generated within 
the framework of the lpc analysis, we must first define a characteri- 
zation of the types of speech segments that are encountered. For this 
purpose we define two binary features that characterize the source and 
the dynamics of the vocal tract. The first feature describes the exci- 
tation for the frame of speech and we denote voiced speech as V, and 
unvoiced speech as V. The second feature describes the vocal tract 
dynamics and we denote the stationary, steady-state case as SS, and 
the nonstationary, time-varying case as SS. Thus^a given frame of 
speech is characterized by the notation {V/V, SS/SS). 

The determination of whether a frame is voiced or unvoiced is fairly 
straightforward and is readily obtained from any number of pitch- 
detection algorithms. The determination of whether a frame is station- 
ary or nonstationary is somewhat more complicated. This computation 
is made as follows. The basic idea is to compare both the long-time 
and short- time features of frames j and i, where j represents the frame 
occurring 15 ms before frame i. A distance comparing frames i and j is 
made as 



a, = 



d[T LT (i), TltU)] + d[T LT (j), T LT (i)] 
+ d[T ST (i), Tst(J)] + d[Tsr(j), T ST (i)] 



(3) 



i.e., the average of the long- and short-time lpc distances between 
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Table I — Feature sets used for similarity determination 

Test Frame Frame Spac- 

Status Feature Set ing Speech Example 

(V, SS) LT analysis 15 ms Vowels, steady-state sounds 

( V, SS) LT analysis 5 ms Onset, offset of voicing transitions 

(V,SS) LT analysis 15 ms Steady fricatives 

(9, SB) ST analysis 5 ms Transients 



frames i and/ and between frames/ and i (recall that the lpc distance 
is not symmetric). The distance score, a,, is then compared with a 
threshold (different for voiced and unvoiced frames), and the station- 
arity value is given as 

fl if V and a, < THV 
SS = \ 1 if V and a, < THU (4) 

lo otherwise, 

where 1 represents a stationary frame, and represents a nonstationary 
frame, and THV and THU are voiced and unvoiced thresholds, re- 
spectively. 

Once a frame has been characterized with the two-feature code, ( V/ 
V, SS/SS), the only remaining step is to specify which feature set and 
frame spacing should be used in the dtw distance computation. 

It should be clear that for voiced frames, (V, — ), the long-time 
analysis should be used to avoid potential bias caused by the pitch 
period. Similarly, for all nonstationary frames, (— , SS), a frame spacing 
of 5 ms should be used to track the fast dynamics of such frames. 
Finally, for unvoiced, nonstationary frames, {V, SS), the short- time 
analysis is most appropriate to follow transients and other brief events. 

Table I shows a summary of the feature sets and frame spacings, for 
each of the four types of frames, as used to determine word and 
reference template similarity. 

To illustrate the above analysis, Fig. 5 shows a series of plots of (a) 
the waveform, (b) the log energy (in dB), (c) the pitch, and (d) the 
average of long- and short-time lpc distance [eq. (3)] for the word 
/B/. It can be seen in Fig. 5a that the lpc distance becomes large at 
the beginning of voicing (point A in the plot), at the termination of 
voicing (point B in the plot), and at the end of the word (point C in the 
plot). Such frames (and their neighborhoods) are the nonstationary 
regions of the word, and generally correspond well with transients, 
onset and offset of voicing, and rapidly varying vocal-tract dynamics. 

To determine the stationarity thresholds intelligently, THV and 
THU, histograms of the behavior of a, for voiced and unvoiced frames, 
had to be measured. Such histograms are shown in Fig. 6. The data in 
this figure were obtained by computing a, every 5 ms for all the frames 
of a 39-word vocabulary of letters of the alphabet plus the digits. Based 
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Fig. 6 — Histograms of values of (a) lpc distance for voiced speech, and (b) unvoiced 
speech. Thresholds THV and THU are chosen to give desired percentages of nonsta- 
tionary classification. 



on the data of Fig. 6, values for THV and THU can be chosen, so as to 
obtain any desired average probabilities of occurrence of voiced or 
unvoiced classification. For example, if we assume that, on average, 
only 10 percent of the voiced frames should be classified as SS, then 
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a threshold of THV = 0.2 should be used. Similarly, for non-voiced 
frames a threshold of THU = 0.3 yields an average of 10 percent of the 
frames being classified as nonstationary. If the thresholds, THV and 
THU, are both set to infinity, then all frames are classified as stationary 
and hence the feature analysis is essentially identical to that of the 
original model. Similarly, if the thresholds are both set to zero, all 
frames are classified as nonstationary and a 5-ms frame spacing is used 
with both short- and long-time feature sets. 

3.2 Modifications to the DTW algorithm for the improved word model 

As discussed above, the basic changes made in the feature measure- 
ment were inclusion of both short- and long-time lpc analyses, and an 
increase in the frame rate of the analysis from once every 15 ms to 
once every 5 ms. These analysis changes required some modifications 
to the dtw algorithm to properly handle the raw data structure. The 
modifications primarily involve reformulation of the local path con- 
straints to account for the diffferent possible frame spacings (i.e., 
nonuniform sampling in time), and modifications to the distance 
computation to handle both long- and short-time lpc distances and 
their appropriate weights. 

We denote the long-time test pattern as {Tlt(i), n = 1, 2, • • • , 
NT}, the short-time test pattern as {Tsrin), n = 1, 2, • • • , NT), and 
the stationarity distance (on which the stationarity profile is based) as 
{a„, n = 1, 2, • • • , NT). Similarly, we denote the long-time reference 
pattern as {Rlt(tti), m = 1, 2, • • • , NR} and the short-time reference 
pattern as {RsAm), m = 1, 2, • • • , NR). 

We wish to solve for the optimum warping path of the form m = 
w(n), defined for values of n that satisfy either of the following 
conditions: 

(n - 1) 3 = (5a) 

or 

a„>TH or a„-,>TH or a„_ 2 >TH. (5b) 

Equation (5a) says we solve for m = w(n) at each standard 15-ms time 
slot. This constraint essentially guarantees a grid spacing, between 
adjacent dtw frames, of no more than three frames. It also guarantees 
that, in the limit, as the entire word is classified as stationary, the new 
analysis becomes identical to the previous analysis. Equation (5b) says 
we solve for m = w(n) at each frame, n, in which the stationarity 
distance, <x n , of that frame or either of its two predecessors falls below 
the specified threshold, TH. (For voiced frames the threshold TH is 
set to THV, and for nonvoiced frames the threshold TH is set to 
THU). Cases in which eq. (5b) is satisfied (i.e., one of the distances is 
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above threshold) correspond to voiced frames with a rapidly changing 
spectrum (transitions), or unvoiced frames with nonstationary excita- 
tion. 

For each frame n that satisfies one of the constraints of eq. (5) we 
must solve the dtw recursion 

D A {n, m) = w(n - n L )d(n, m) 

+ min [D a (n - n L , mo)], m L -^m<m H , (6) 

m;<mo<m;; 

where 

n-L = last value of n for which a dtw recursion was done. 
Al = next-to-last value of n for which a dtw recursion was 
done. 
w(n — ul) = weighting function on the local distance to account for 
the nonuniform frame spacing. 
d(n, m) = local frame distance for reference frame m and test 
frame n. 
riiL = smallest value of m at n = ul from which a valid path 

can go to the grid point (n, m). 
lfiH = largest value of m at n = /it from which a valid path 

can go to the grid point (n, m). 
jyil = smallest value of m at frame n for which dtw recursion 

is solved. 
m.H = largest value of m at frame n for which dtw recursion 
is solved. 

The values of ml and m« are determined from the global path con- 
straints which specify that all valid dtw paths must lie within a 
parallelogram defined from lines of slope 2 and slope 1/2 beginning at 
grid point (0, 0) and ending at grid point (NT, NR). Thus, m L and m H 
satisfy the path constraints 

m L = max[(n - l)/2 + 1, 2 X (n - NT) + NR, 1] (7a) 

m H = min[2 x (n - 1) + 1, (n - NT) 12 + NR, NR]. (7b) 

The values of raz, and ttlh are those which guarantee that the path to 
grid point (n, m) satisfies the local constraint that the average slope be 
no less than one half nor more than 2. If we define a path increment 
function, A(m), as 

A(m) = increment in m along the best path to grid point (n L , m), 

i.e., if the best path to grid point (n L , m) comes from grid point [/iz., m 
- A(ra)], then values of mo in the dtw recursion [eq. (6)] must satisfy 
the local path constraint 

2300 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1 982 



(n ni) < A(m ) + (m - mo) < 2(n - #k). (8) 

Since A(rao) also satisfies the constraint 

A(m ) < 2 (/i L -/?/.), (9) 

we can rewrite the inequalities of eq. (8) as 

m//-A(m w )< (10a) 

m/. >: m - 2(n - /it) . (10b) 

Equation (10a) must be checked for each possible m value to find its 
solution, whereas eq. (10b) can be used directly. 
The weighing function w{n — ul) is simply 

w(n — n L ) = (n — ul) (11) 

to give more weight to longer frame separations, and the distance 
d(n, m) of the form 

' d[T LT (n), R LT (m)] if (V, SS), (V, SS) 



d(n, m) =< 



or (V, SS) (12a) 

.d[T ST (n), RsAm)] if (V, SS). (12b) 



The complicated form of the dtw recursion is due to the nonuniform 
sampling rate at which the recursion is solved. If we translate eqs. (6) 
through (12) into words we can say that for each frame n for which the 
recursion is solved we compute D A (n, m) for a range of m from m = jul 
to m = m,H, as determined by the global path constraints. For each m 
the optimal path is determined as the weighted local distance, 
d(n, m)w(n — ul), (as determined by the stationarity profile at frame 
n) plus the best accumulated distance to a predecessor frame that is a 
valid candidate for a path to frame m (i.e., tUl ^ mo < mn). The range 
on mo is chosen to guarantee that the local path constraints of a 
warping curve slope of between 1/2 and 2 are met. Since the number 
of frames between the current frame n and the predecessor frame Hl , 
for which the dtw recursion was last solved, is variable (ranging from 
1 to 3), the local path constraints must use this range, along with 
information as to how much the local path rose [A(m)] at frame (til, 
mo) to set the local path constraints correctly. 

The dtw recursion of eq. (6) is solved for all valid points from n = 
1 to n = NT, and the total dtw solution is then given as 

D*=D A (NT,NR) (13) 
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and the average path distance is 

a-Mgw ,i4, 

3.3 The improved decision rule 

As we discussed earlier a two-pass decision rule is used to improve 
recognition accuracy. The task of the first-pass decision rule is to 
determine the set of vocabulary words that are acoustically similar to 
the test word (i.e., the set of confusions). The task of the second-pass 
decision rule is then to resolve these confusions. 

The key idea behind the operation of the second-pass decision rule 
is that the dtw distance scores between the test pattern and those 
reference patterns that are acoustically close to each other and to the 
test pattern consist of a x 2 random component and a Gaussian random 
component. The x 2 random component is associated with the averaging 
of distance scores between frames with the same basic spectrum, and 
therefore has a x 2 distribution withp degrees of freedom. The Gaussian 
random difference is associated with the averaging of large distance 
scores between frames with dissimilar spectra. 

In cases where the size of the dissimilar region is small (such as in 
comparing a /B/ to a /D/) compared to the size of the similar region, 
the x 2 component distance often outweighs the Gaussian component, 
thereby causing potential recognition errors. 

The purpose of the second-pass decision rule is to enhance the role 
of the Gaussian component associated with spectrally dissimilar re- 
gions in determining the final decision. This is accomplished using a 
distance-weighting function that enhances the discrimination power of 
the frame-by-frame distance scores. 

By way of example, consider a simple confusion list of two references, 
Ri and Rj, for test word T. Let the dtw frame-by-frame distance and 
warping path be specified as 

d k (n) = d{T(n), R k [w(n)]} (15) 

and 

Wk(n) = Warping path comparing frame n of the test with reference 
R k . 

We now define two distance-weighting functions, 

{W iJ (n),n =1,2, ...,iW?i) 

{W"(n),n = l,2, •••tNRj), 

where W iJ (n) is the weighting to discriminate Rj from R if and W J,t (n) 
is the weighting to discriminate Ri and Rj. (Reference 11 shows that 
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these weighting functions are generally not symmetric). We defer a 
discussion of how the weights are generated, in a speaker-trained 
system, to Section 3.4. 

The basic hypothesis is that the test pattern, T, corresponds to 
either Ri or Rj, and we wish to come up with a discrimination score 
that aids in this decision. If we define a discrimination score, S(T, Rt\ 
T E Rj), as the weighted distance between the T and Ri, assuming 
that T actually corresponds to Rj, then we get 

NT 

I W lJ [ Wi (k)-\d{T(k), Ri[w(k)]} 

S(T, fl,| T E Rj) = — ^ , (16) 

Y W^[ Wi (k)] 



*=i 



and similarly we get 

NT 

I W»[wj(k)]d[T(k), Rj[w(k)]) 

8(T, Rj\ T E Ri) = — m . (17) 

£ W J [wiik)] 

The weighted distance corresponding to the hypothesis T E Rj [i.e., 
eq. (16)] is shown in Fig. 7. The frame-by-frame distance is multiplied 
by the weighting function reflected through the warping curve to give 
the discrimination score 8. 

The discrimination distances of eqs. (16) and (17) have the following 
important property. If T and i?, are from the same word (different 
replications) then the frame-by-frame distances, d(-, •) are all x" 
distributed (theoretically) and thus S(T, R/\T e. Rj) will be theoreti- 
cally "independent" of the weighting function. If, however, T and Rj 
(instead of Ri) are from the same word, then 8{T, R t \T E Rj) will 
reflect to a greater extent the Gaussian-distributed component of the 
original distance score, d(T, R,), since it primarily consists of distance 
in regions where Rj and Ri differ significantly, even though they may 
be quite short. 

Thus, in the simple case of a confusion between two references, Ri 
and Rj, the final decision is made on the basis of the discrimination 
scores of eqs. (16) and (17). 

More generally, if the confusion list associated with test pattern T 
has Q candidates, {R,,, Ri,, • • • , Ri Q ), then the following procedure is 
followed: 

(i) Compute all pairs of discriminations 

8(T, R ia \T E R, h ), b * a, a, b = 1, 2, • • • , Q). 

ISOLATED WORD RECOGNITION 2303 



NR, 




W'-'lk) 



d,{k) 



FRAME-BY-FRAME 
DISTANCES 



Fig. 7 — The time warping plane of a test and reference pattern along with the distance 
of each frame and the weighting curve on distance. 

(ii) Form the average discrimination distance 



8(T, RO = 



Q - 1 6=1 



I 8(T,R ia \T=R ib ), a -1,2, ...,Q. 



(Hi) Define the most likely candidate, Ri, as the candidate with the 
minimum average discrimination distance, i.e., 

5 M1N = min {8(T, #,„)}. 

a 

Similarly, a least likely candidate with maximum distance is defined 

as 

Smax = max (8(T, #,„)} 

a 

(iv) Given the original (i.e., first-pass) distance scores for all Q 
candidates, d(T, R ia ), with smallest distance g? M in and largest distance 
c/max, a second-pass set of distances scores is computed by retaining 
second-pass ordering with first-pass distances. This procedure is illus- 
trated in Fig. 8. A reference with second-pass discrimination score 
S(T, Ri) is given distance d(T, Ri) by linearly interpolating along the 
line of Fig. 8. 

3.4 Determination of the weighting curves in the speaker-trained case 

The determination of the weighting curves, W jJ and W' J , is readily 
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SECOND-PASS DISCRIMINATION SCORE 

Fig. 8— Linear transformation between second-pass distance score and second-pass 
discrimination score. 



performed in the training phase for speaker-trained systems. Given 
reference templates Ri and Rj, as obtained using the robust training 
procedure of Rabiner and Wilpon, 19 a simple way of obtaining W tJ is 
to warp Ri to Rj, giving 



W' j (k) = d{Rj(n), Ri[wj(n)]}, 



(18) 



where wj(n) denotes the warping path. Thus, the frame weights (W) 
are essentially the frame-by-frame warped dtw distances between the 
reference templates. Figure 9 shows weighting functions for references 
corresponding to the words /// and /Y/. When compared with the 
speaker-independent weights of Rabiner and Wilpon, 11 we immediately 
see the statistical effects of small samples. It is evidence that the 
curves of Fig. 9 need some smoothing to reduce the statistical variance. 
The resulting of applying a 3-point smoother (a triangular window) to 
the data of Fig. 9 is given in Fig. 10. A good deal of the statistical 
variation in the curves is smoothed out. 

An alternative, more statistically meaningful, way of obtaining 
smoother weighting curves is to use all P replications of each word in 
the training set to determine the weights. Basically, we obtain a 
weighting function for each pair of training tokens such that each 
token is close in distance to the appropriate reference. The final 
weighting curve is then obtained by averaging the individual weighting 
curves, with appropriate time alignments. We use the term subweights 
to denote the set of weights obtained by averaging all training tokens, 
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Fig. 9 — Typical weighting curves for (a) / vs Y, and (b) Y vs /, derived from the 
robust training tokens. 



and we use the notation S to refer to this set. Figure 11 illustrates the 
(sub) weighting curves for /, Y comparisons based on a set of five 
training tokens for each word. 

IV. EXPERIMENTAL EVALUATION OF THE IMPROVED MODEL 

To measure the performance of the improved, LPC-based, isolated 
word-recognition model, a small evaluation test was performed. Each 
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FRAME NUMBER 

Fig. 10— Smoothed weighting curves for (a) / vs Y, and (b) Y vs /, derived from the 
robust training tokens and a 3-point smoother. 



of four talkers (two male, two female— all experienced with speech- 
recognition systems) trained the recognizer on a 39-word alpha-digit 
vocabulary by saying each vocabulary word five times during the 
course of a single training session. The word-reference patterns, the 
normal discrimination weights, W, and subweights, S, were determined 
from the training data using the robust training procedure of Rabiner 
and Wilpon. 19 
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Fig. 11— Subweight curves for (a) / vs Y, and (b) Y vs /, derived from using all 
training tokens. 



For evaluation purposes the 39-word vocabulary was spoken 10 
additional times by each of the four talkers in two distinct recording 
sessions. Thus, a total of 390 words were used in each recognition test 
for each talker. 

4. 1 Recognition test results 
The overall results of the evaluation tests are given in Table II, 
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Table II — Recognition accuracies as a function of the stationarity 
thresholds and the number of recognition passes for the four talkers 





(THU, THV) 




Talker Number 






1 


2 


3 


4 


Pass 1 Alone 


( — 00, oo) 

(-.3, .2) 
(0., 0) 


94.9 
95.4 
94.9 


94.9 
94.9 
94.9 


90.5 
91.8 
91.5 


86.7 
86.4 
85.4 


Pass 2 With Weight W 


( — 00, oo) 

(-.3, .2) 
(0., 0) 


96.7 
96.4 
95.6 


95.6 
95.6 
95.4 


94.1 
94.9 
94.4 


87.2 
88.5 
86.4 


Pass 2 With Subweight S 


( — 00, oo) 

(-.3, .2) 

(0., 0) 


95.6 
95.4 
95.4 


95.9 
95.4 
95.6 


95.4 
96.2 
95.1 


87.2 
87.9 
87.2 



which shows recognition accuracy as a function of stationarity thresh- 
olds, talker, and analysis condition. Three analysis conditions are 
shown, namely Pass 1 alone (no discriminant analysis), Pass 2 with 
weights, W, derived from single reference tokens, and Pass 2 with 
subweights, S, derived from all reference tokens. 

The results of using Pass 1 alone show only a 0.4-percent improve- 
ment, on average, in recognition accuracy for the four talkers when 
comparing the old stationary model (where THU = — oo, THV = oo) 
with the new stationary model (where THU = -0.3, THV = 0.2). 

The results of using Pass 2 with weights W show an average of 2.1- 
percent improvement in recognition accuracy for the four talkers over 
the old stationary model (when THU = -0.3 and THV = 0.2). When 
subweights S are used in Pass 2, the improvement in recognition 
accuracy is an average of 2 percent. 

Table II also shows that when Pass 2 is used the recognition accuracy 
with stationarity thresholds set to (-0.3, 0.2) is, on average, about 0.5 
percent higher than with stationarity thresholds set to (— oo, oo). This 
result indicates that the improved model provides a consistent recog- 
nition accuracy improvement of about 0.5 percent, with or without the 
second-pass weights. 

V. DISCUSSION 

The results presented in Section IV are both encouraging and 
discouraging. They are encouraging in that real improvements in 
recognition accuracy were obtained when a nonstationary analysis 
framework was used in place of the purely stationary framework used 
in earlier work. They are discouraging in that the average improvement 
resulting from the nonstationary model (0.5 percent) was considerably 
smaller than the average improvement resulting from the discrimina- 
tion analysis of the second pass (1.6 percent). 
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There are several points worth noting that have bearing on the 
discussion and results of this paper. The first concerns the anticipated 
improvement in performance for the improved word-recognition 
model. If one carefully considers the sources for recognition errors 
with the alpha-digit vocabulary, it should become clear that the 
anticipated improvement resulting from the nonstationary analysis 
should be small unless some extra weighting is applied to the nonsta- 
tionary regions. This is because words that are strongly affected by 
the nonstationary analysis (e.g.,p, d, t, k, etc.) are easily confused with 
similar words in the vocabulary (e.g., b, v, g, a, etc.), and since the 
nonstationary regions are only a small subset of the word patterns, the 
improved analysis will be swamped out by the word-similarity regions. 
This is the original motivation for the discriminant analysis model 
used in the two-pass word recognizer. 11 Hence, the results of Section 
IV, which show a small (but consistent gain) for the improved analysis 
model and a somewhat larger gain for the discriminant model, are 
entirely consistent with the anticipated results given above. 

A second point of note is that the implementation of the improved 
word model was more of a convenient one, rather than one that 
naturally followed from the theory. Thus, the short-time features were 
lpc coefficient sets derived from a short-time window. This implemen- 
tation was straightforward and required only minimal modification of 
the recognizer structure. A more reasonable implementation of the 
short-time analysis in the model would have been something like a 
filter bank model, or a basilar membrane model. Such features would 
then have complemented the long-time lpc features and would have 
provided a better vehicle for testing and evaluating the improved 
model. The problem with using these alternative short-time feature 
sets is that there is no simple way of combining lpc and filter bank (or 
basilar membrane model) features and deriving from them a distance 
measure with good physical properties. The problem of combining lpc 
and energy features has already been investigated by Brown and 
Rabiner, 20 and it was shown that no simple metric existed even for 
such a simple case. The main point in the above discussion is that the 
small gain of the improved word model is more impressive when one 
considers the simplicity of the short-time analysis used to provide the 
performance gain. 

The third point of note is the fact that the simple weighting derived 
from the robust training procedure seemed to provide the same per- 
formance improvement as the more sophisticated weighting obtained 
by using multiple tokens in obtaining the weights. The obvious conclu- 
sion to be drawn from the result is that the gain obtained from the 
second pass (which is due primarily to small regions of extreme spectral 
difference) is manifested in any pair of training tokens and that simple 
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smoothing (to eliminate statistical variability) is as good as using 
multiple tokens. 

When one takes into consideration all of the above points, the results 
of Section IV provide a reasonable basis for believing that the improved 
word-recognition model is a reasonable one and that both the nonsta- 
tionary analysis of the first pass, and the discrimination analysis of the 
second pass provide real performance gains. 

VI. SUMMARY 

An improved word-recognition model was proposed in which the 
standard long-time analysis features of the model are combined with 
a set of short-time analysis features. A stationarity index is also 
computed for each speech frame indicating which set of features (long- 
time or short-time) best characterized the current frame of speech. 
Appropriate modifications to the dtw algorithm were required to 
handle the enhanced analysis feature set. Also incorporated in the 
recognition model was a speaker-trained version of the discriminant 
analysis, two-pass model proposed by Rabiner and Wilpon." 

An evaluation of the model based on an lpc implementation of both 
long-time and short-time feature sets showed the overall improved 
word model had from 1- to 5.7-percent improvement in recognition 
accuracy across four experienced users of speech recognition systems 
using an alpha-digit word vocabulary. On an average the nonstationary 
feature set alone led to a 0.5-percent improvement in accuracy, whereas 
the two-pass discriminant analysis alone led to a 1.6-percent average 
improvement in accuracy. The two improvements were almost inde- 
pendent and the overall recognizer had, on average, a 2.1 -percent 
improvement in word accuracy. 

The above results are considered encouraging and indicate that the 
improved model should be considered with alternative short-time 
feature sets. 
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